Annotation of genome sequences

ABSTRACT

A method of identifying one or more proteins in an unannotated DNA sequence is disclosed. The method involves dividing the DNA sequence into a plurality of sequence fragments of substantially the same length (about 300 to 5000 base pairs, most typically 1000 to 1050 base pairs. A six frame translation is then performed on each of the DNA sequence fragments to obtain six translated amino acid sequence fragments for each DNA sequence fragment. Each of the translated sequence fragments is subjected to theoretical digestion to obtain a plurality of cleaved peptide sequences. Next experimental empirical data for peptide fragments from a protein digested in the same manner as the theoretical digestion is compared with the theoretical data generated in step for each of the translated sequence fragments to identify one or more translated sequence fragments which include a substantial number of peptides present in the digested protein. The sequence fragment which has the greatest number of theoretical peptide masses correlating to the empirical data indicates the likely location of the protein of interest in the DNA sequence. To avoid problem where the sequence is divided at the site of a protein, the DNA sequence is duplicated and the original and duplicate are split in such a manner that the sequence fragments from the original overlap the cuts in the original genome sequence.

FIELD OF THE INVENTION

This invention relates to a method of annotation of genome sequences.

BACKGROUND OF THE INVENTION

Many genomes, including the human genome have now been sequenced. Agenome sequence provides a list of bases (A, T, G, C) in the order inwhich they appear in a length of DNA, however, the sequence per se tellsone very little about the genome that is useful and easily orimmediately comprehensible. For example in the study of a diseasecausing bacteria it would be useful in searching for a cure for thedisease to determine the location of that part of the bacterium's genomewhich expressed a particular protein. However, it can be difficult topredict where proteins of interest may be located in a genome sequence.It cannot always be done simply by looking at the sequence per se.

There are a number of known processes for attempting to determine thelocation of proteins in genome sequence data. The most widely usedmethod for annotation are pattern searching and sequence comparisontechniques. One other known method uses computer programs to locaterecognisable regions such as start codons and stop codons in a DNAsequence. Other programs attempt to locate proteins by locating regionsof high complexity within a DNA sequence which typically indicates thelocation of a protein.

However, these approaches are far from perfect as in order to implementthese programs, various assumptions and hypotheses have to be made aboutthe location of a protein of interest in the DNA sequence, inparticular, the potential start and stop positions of the protein. Adetection method that requires such assumptions or hypotheses mayproduce incorrect results if the assumptions/hypotheses are incorrect.For example these procedures are unlikely to locate non-typicalsequences, which ironically may be of more interest than other proteinshaving more typical sequences identified using existing techniques.

Thus, it is one object of the present invention to provide a method forannotating genome sequences, which is hypothesis independent and doesnot make assumptions for the detection of a protein from nucleic acidsequences.

Any discussion of documents, acts, materials, devices, articles or thelike which has been included in the present specification is solely forthe purpose of providing a context for the present invention. It is notto be taken as an admission that any or all of these matters form partof the prior art base or were common general knowledge in the fieldrelevant to the present invention as it existed in Australia before thepriority date of each claim of this application.

SUMMARY OF THE INVENTION

A first broad aspect of the present invention, provides a method ofidentifying one or more proteins in an unannotated DNA sequence, themethod comprising:

(a) dividing the DNA sequence into a plurality of sequence fragmentseach fragment being of substantially the same length and from about 300to 5000 bases long;

(b) performing a six frame translation of each of the DNA sequencefragments to obtain six translated amino acid sequence fragments foreach DNA sequence fragment;

(c) subjecting each of the translated sequence fragments to theoreticaldigestion to obtain a plurality of cleaved peptide sequences;

(d) comparing experimental empirical data for peptide fragments from aprotein digested in the same manner as the theoretical digestion at step(c) with the theoretical data generated in step (c) for each of thetranslated sequence fragments to identify one or more translatedsequence fragments which include a significant number of peptidespresent in the digested protein.

Thus the present invention identifies a region of a genome that encodesa protein and optimally defines the open reading frame and therefore thesequence of the protein from the genome. An advantage of the presentinvention is that no assumptions need to be made about the location ofproteins in the DNA sequence data. DNA sequences with non-typical stopand or start codons may be located. The results are hypothesisindependent.

Typically the theoretically generated peptide masses are compared to themasses of the peptides experimentally generated by the digested proteinand the sequence fragment which has the greatest number of theoreticalpeptide masses correlating to the empirical data indicates the likelylocation of the protein of interest in the DNA sequence. The masses ofthe peptides experimentally generated from the digested protein willtypically be determined by mass spectrometry.

It is preferred that the DNA sequence is duplicated and the original andduplicate are split in such a manner that the sequence fragments fromthe original overlap the cuts in the original genome sequence.

It is important that the sequence fragments are approximately the samelength as one another and are sized to equate to the length of a typicalprotein. Hence, each fragment is, as discussed above, about 300-5000bases long. Proteins vary in size, most proteins being 10 to 100 kDai.e. about 300-3000 bases long. Most preferably, the sequence fragmentswill be around 1000 or 1050 bases long, the latter translating to 350amino acids which is approximately equivalent to a 33 to 37 kDa protein,which is a common size for a protein.

Using DNA sequences of approximately that length produce about 12 to 20peptide matches against a background number of matches of commonlyaround 1 or 2, and up to around 4 for sequences which do not contain aprotein.

In a related aspect of the present invention, the step of dividing theDNA sequence and the step of performing the six frame translation can bereversed. Hence, a second broad aspect of the present invention providesa method of identifying one or more proteins in unannotated DNAsequence, the method comprising:

(a) performing a six frame translation of a DNA sequence to provide sixtranslated amino acid sequences;

(b) dividing the six translated amino acid sequences into a plurality offragments, each fragment comprising 100-1666 amino acids;

(c) subjecting each of the fragments to theoretical digestion to obtaina plurality of cleaved peptide sequences;

(d) comparing experimental empirical data for peptide fragment forpeptide fragments from a protein digested in the same manner as thetheoretical digestion at step (c) with theoretical data generated instep (c) for each of the fragments to identify one or more fragmentswhich include a significant number of peptides present in theempirically digested protein.

BRIEF DESCRIPTION OF THE DRAWINGS

A specific embodiment of the present invention will now be described byway of example with reference to the accompanying drawings.

FIG. 1 is a flow chart depicting an overview of the process described inthis patent application.

FIGS. 1A to 1E are schematic diagrams illustrating various steps in themethod of the present invention.

FIG. 2 is a more detailed flow chart depicting the part of the processinvolving the segmentation of the genome.

FIG. 3 is a more detailed flow chart depicting the pant of the processinvolving the translation and theoretical digestion of the genomicsegments.

FIG. 4 is a detailed flow chart depicting the part of the processinvolving the identification of the region of the genome after thepeptide mass fingerprinting is complete.

FIG. 5 shows an example of the method in operation using experimentaldata derived from a spot on a 2D gel of a sample from Mycobacteriumtuberculosis The figure identifies the region of the genome coding forthis protein as the portion extending over segments 800 to 803. Thenumber of matches or “hits” associated with these segments is distinctlyhigher than the background number of hits (less than 6).

FIG. 6 shows a detailed view of segment 801 from the search described inFIG. 5 showing the match between specific experimental masses andindividual peptides from the theoretical digestion of this segment ofthe genome. Comparison with the SWISS-PROT database using BLAST showsthis region is the coding region for the protein.

FIG. 7 shows a second example of the method in operation on experimentaldata derived from a different spot from the same sample described inFIG. 5. The figure identifies two potential coding regions (oneinvolving segments 7308 and 7309 and the other involving segments 8290and 8291). As a number of matches is not substantially above thebackground, further information is required to confirm this is a codingregion.

FIG. 8 shows a detailed view of segment 7309 from the search describedin FIG. 7 showing all but one of the peptide matches are located in acontiguous region of amino acids between two stop codons. This confirmsthis segment is a coding region. Comparison with the SWISS-PROT databaseusing BLAST shows this region is the coding region for the protein.

FIG. 9 shows a detailed view of segment 8290 from the search describedin FIG. 7 showing all but one of the peptide hits are located in acontiguous region of amino acids between two stop codons. This confirmsthis segment is a coding region. Comparison with the SWISS-PROT databaseusing BLAST shows this region is the coding region for the protein.

FIG. 10 shows a detailed view of segment 318 from the search describedin FIG. 7 showing all but two of the peptide hits are separated fromeach other by stop codons. This confirms this segment is not a codingregion.

FIG. 11 shows a graph depicting the results of a simulation todemonstrate the effectiveness of the method. On average, for allproteins in Pseudomonas aeruginosa, the best hit, corresponding to thecoding region has more peptide hits than the nearest incorrect hit. Thisdistinction is particularly evident in large proteins but decreases asthe proteins become smaller.

FIG. 12 is a graph depicting the effect of changing the segment size onthe average best and nearest incorrect hits. As the size of the segmentsincreases, the distinction between the two curves increases. The effecton the best hit is limited by the size of the protein. Once the proteinis smaller than the size of the segment, there is no longer any effecton the “best hit” curve.

FIG. 13 shows a figure depicting the definition of the best and secondbest hit. The nearest incorrect hit is the segment having the most hits,when the segments overlapping the best hit are ignored. This is anecessary distinction because adjacent segments to the top hit willoften have a large number of hits because the protein sequence isextending across multiple segments.

FIG. 14 shows an example of the application of the method to Homosapiens. The theoretical digestion of Apolipoprotein L5 (Q9BWW9) wassearched against the genomic data from chromosome 22 of H. sapiens. Thefigure identifies a potential coding region involving segments 36302 and36303. As there are a number of other matches with similar numbers ofhits, further information is required to confirm this as a codingregion.

FIG. 15 shows a detailed view of segment 8866 from the search describedin FIG. 14 showing the large number of hits is artificial because oneexperimental mass has matched several, separate points on the segmentbecause this segment contains a repeat region. All matching segmentsexcept 36302 and 36303 were similar in that they involved repeatregions.

FIG. 16 is a detailed view of segment 36302 from the search described inFIG. 14 showing the match between specific experimental masses andindividual peptides from the theoretical digestion of this segment ofthe genome. This confirms these segments are a coding region, andcomparison with the SWISS-PROT database using BLAST shows this region isthe coding region for Q9BWW9.

DETAILED DESCRIPTION OF A PREFERRED EMBODIMENT

Referring to the drawings, FIG. 1 is a flow chart showing an overview ofthe method of the present invention. The first step 20 involves theacquiring of a genome sequence. In the next step 22, the genome sequenceis split into overlapping fragments. Next at step 24, the fragments aretranslated in six frames and at step 26 a theoretical digest of theprotein sequence fragments generated by the six frame translation iscarried out. Step 28 which is independent of the theoretical treatmentof the genome sequence shown in boxes 20 to 26 is the acquiring ofexperimental peptide masses, typically by mass spectrometry. The nextstep 30, involves the comparison of the experimentally determinedpeptide masses with the theoretical masses. Step 32 is the process ofidentifying the best hits, and step 34 is the step of identifying thegenome region corresponding to the protein. The process is showndiagrammatically in FIGS. 1A to 1E.

FIG. 1A, shows a genome sequence 10 which is taken and split into aseries of shorter genome sequences or sequence fragments 12. Overlappingsequences are preferably provided by duplicating the genome sequence andcleaving the duplicated sequence at locations midway between the breaksin the original sequence so that the sequences (12 a,12 b . . . , 14 a,14 b . . . ) are overlapping as shown in FIG. 1A.

The segments are overlapped to facilitate the process of identifying theregion of the genome coding for the protein of interest. In some cases,the peptide masses from the protein of interest could be distributedacross two adjacent segments, with a portion of the peptide masses atthe end of one segment and a second portion at the start of the nextsegment. This means the number of peptide masses on each of the twosegments will be closer to the background number of random, “noise”matches found on the remaining segments making it harder to identify thehit. However, by using overlapping segments, the peptide t the end ofone segment and the start of the next will all be located on the common,overlapping segment. This means the number of peptides on the common,overlapping segment will be further from the background number ofrandom, “noise” matches making it easier to identify this segment as thecorrect location of the protein-coding region in the genome.

In principle, the overlap is not absolutely necessary for the method towork but it is significant in distinguishing a hit from background“noise”, particularly in the case of relatively small proteins. Forexample if overlapping were not used and a relatively small protein fellequally between two adjacent segments, only three or four hits might beobtained for each segment. This would not be distinguishable over thebackground “noise” of typically about 4 hits, so it would not identifythe protein. Using overlapping segments, there is a good chance thesmaller protein would fall in a single fragment, and the number of hitswould be maximised and so facilitate the identification.

Typically, the genome will be cut into sequence fragments which are 1050bases long. This approximates to 350 amino acids which will be found ina protein of around 33 to 37 kDa which is a common protein size. Abacterium such as Mycobacterium tuberculosis (Tb) will have around 4.4million bases in its genome. Duplicating and cutting that genome willresult in approximately 8400 sequence fragments.

FIG. 2 shows a flow diagram depicting an algorithm for carrying out thepart of the process involving segmentation of the genome. The first step40 involves the acquisition of a genome sequence and the user definedlength “x” of segment into which the genome is to be cut, x typicallybeing 1050 base pairs. The first x bases from a starting point at oneend of the genome sequence are then acquired at step 42 to create agenomic segment x bases long at step 44. Next at step 46 a check iscarried out to see if there are any more base pairs in the genomesequence and if the answer is yes, the next x bases are removed at step42 again to create a second genomic segment and so on until there are nomore base pairs in the genome sequence and the entire sequence has beensegmented. When there are no more base pairs in the genome sequence, thealgorithm moves to step 48 where a new starting point at base number Ais identified, the next x bases from that starting point are thenremoved at step 50 and used to create a genomics segment, step 52, andthe process is repeated, step 55, until there are no more base pairs inthe genome sequence. For ease of analysis the first set of segments arenumbered 1, 3, 5, . . . n+2, . . . and the second set of fragmentsoverlapping the first are numbered 2, 4, . . . n+1, . . . which ensuresthat the fragment overlapping two fragments x and x+2 is x+1. Thisindicates where segments are relative to each other in a readilyunderstandable way and makes it easier to interpret the results.

The genome is segmented to enable easier identification of theprotein-coding region of the genome. The genome is segmented into fixedsections, regardless of the length or possible location of the proteincoding regions. Hence, the number of background or random matches to thepeptide masses is reasonably constant and this then helps to identifythe protein coding regions. When the number of matches against a regionexceeds the number of random matches on other segments, a protein-codingregion is indicated.

If the genome were not segmented, it would be difficult to determinewhen a concentration of hits was indicative of a protein-coding region.It would be necessary to look for a certain number of hits in a certainlength region, but the exact value of these parameters would need to bepre-determined and may affect the results.

Each segment of the genome simulates a protein (the translation of acertain region of a genome). By segmenting, the peptide mass analysis isanalogous to peptide mass fingerprinting. This allows the use of anumber of existing PMF search engines to do the analysis. Mostadvantageously, the present invention addresses a very complex problemof mining of genomes with proteomic data but presents the results ofthis in a way which is completely familiar and highly understandable tothe proteomics researcher which does not require the researcher torelearn a new tool or paradigm.

Further, segmenting the genome has advantages in terms of computationalperformance. In particular, working with a whole genome at once islikely to be demanding in terms of computer memory. Smaller segments canbe analysed sequentially and thus require less memory at any particularpoint in the calculation.

A six frame translation is then carried out on each of the sequencefragments. FIG. 1B schematically illustrates a 6 frame translationcarried out on one of the sequence fragments (14 d). Six frametranslation is a well understood term for the translation of a givennucleotide sequence to the peptide to the peptide sequence in accordancewith the universal genetic code, with the translation being done in allthree reading frames and in the forward and reverse directions. For eachfragment, six virtual proteins are produced. Fragment 14 d produces sixvirtual proteins 16 a-16 g. Using the M Tuberculosis example referred toabove the 8400 sequence fragments become 50,400 virtual proteins, Thesevirtual proteins are then subjected to theoretical digestion accordingto rules which mimic the action of an endoproteinase enzyme such astrypsin which cut at specific target sites on a target sequence. In apreferred embodiment of the theoretical digest all theoretical peptideswhich contain a stop codon are removed however the mass of thetheoretical protein is calculated from the n terminus of the peptide upto and including the amino acid n terminal to the stop codon. Thisreduces background noise. This digestion is schematically illustrated inFIG. 1C. Each virtual protein becomes a series of “virtual peptides” andthe mass of each virtual peptide is calculated. “Protein” 16 g becomessix peptides 18 a to 18 g. Fewer or more peptides may be produced fromeach virtual protein. The protein of interest is then subjected to anempirical digestion using the same enzyme and peptide mass data isobtained from mass spectrometry of the peptides expressed by thatprotein. FIG. 3 is a flow chart depicting part of the process involvingthe translation of the theoretical digestion of the genomic segments.

The masses of the various empirically derived peptides are then comparedwith the theoretical peptide masses produced by theoretical cleavage ofthe sequence fragments. This is done in a stepwise manner and frame byframe whereby all the empirical peptide masses are matched against allpeptides from the first virtual protein and the number of matchingpeptides (matches or “hits”) is recorded. For each virtual protein, thisprocess is carried out six times, once for each of the amino acidtranslations. However, the number of matches for each frame iscalculated separately and the matches are not summed together. Thisprocess is then repeated for the second virtual protein and so on, untilit has been carried out for all the virtual proteins. This step isillustrated in FIG. 1D. There is a background number of matches.Typically, each theoretical protein or sequence fragment will produce 1or 2 matches with a maximum of about 3 or 4 peptides having masses whichcorrelate to masses produced by the actual empirical digest of theprotein of interest. The sequence fragment which produced the protein ofinterest will, in contrast, typically have about 12 to 20 peptidematches with the empirical digest of the protein of interest but islimited by the number of peptides generated empirically. FIG. 4 is aflow chart illustrating this process.

Clearly the relevant part of the genome sequence may have been cut inthe original division of the genome sequence, however the overlapping ofthe original and duplicate genome sequences reduces the risk of this.Even if the protein is split it may still be possible to identify therelevant part of the genome sequence if there are a reasonable number ofhits, e.g. 6 to 10, in two adjacent overlapping fragments. The part ofthe sequence which carries the most peptide masses which match thepeptide masses produced by the empirical digestion and has a number ofhits which is clearly above the background (noise) level is likely to bethat part of the genome which carries the protein of interest. Byknowing where the part of the sequence came from this identifies thelocation of the protein in the genome sequence (FIG. 1E).

EXAMPLE A (i)

FIGS. 5 to 10 illustrate the results of carrying out the method of thepresent invention,

A culture of Mycobacterium tuberculosis was used as the source ofproteins for experimental analysis. The sample was prepared and theproteins separated using 2D gel electrophoresis. A number of spots werecut from the gel, digested with trypsin, and the peptides resulting fromthe digestion were analysed with MALDI mass spectrometry. These peakswere analysed using standard peptide mass fingerprinting to identify theproteins contained in each spot,

The genome of M. tuberculosis was segmented into 1050 base pairsegments, translated, and theoretically digested using the processdescribed above. The peaks were searched against the genome using themethod of the present invention as described above.

The peaks from a first spot were searched with 0.1 Da error tolerance,allowing for cystines to be modified by iodoacetamide and for methioninesulfoxide modifications, and minimum to match of four hits.

FIG. 5 shows a summary of the results illustrating all the theoreticalsequence fragments which produced four hits or more, Four consecutivesegments (800-803) received 10, 12, 12, and 6 hits respectively. Allother segments had less than 6 hits. This indicates the protein found onthe gel matches the region of the genome stretching across these foursegments. The protein sequence of segment 801, shown in FIG. 7, wascompared to all the proteins in the SWISS-PROT database using BLAST. Theprotein was thus identified as “Chaperone protein dnaK (P32723)”. Thisprotein of molecular weight 66.7 kDa exactly matches the identificationdetermined by standard peptide mass fingerprinting, indicating that themethod described in the patent application correctly identified theregion of the genome coding for the protein of interest.

EXAMPLE A (ii)

A second spot from the gel was then searched. FIG. 7 is a summary of theresults. The peaks from the second spot were searched with the sameparameters described above except a value of five hits was used as theminimum to match. Two regions of interest were found. The first involvedsegments 7308 (6 hits) and 7309 (8 hits), the second involved segments8290 (7 hits) and 8291 (6 hits). There was one other segment with 6hits. All the other segments had less than 6 hits. This is illustratedin FIG. 7. The portion of the protein sequence between two stop codonshaving the most hits was, in each case, submitted to BLAST as describedabove. The first region, shown in FIG. 8, identified as “10 kDachaperonin (P09621)” The fact that this is a good result is indicated bythe fact that the peptides all occur in a region of consecutive aminoacids with no stop codons. Another indicator of a valid result is tocheck for the presence of initiation methionine. However, it is to benoted that in this case there is no initiation Methionine in this area.This indicates that either there a non-standard start codon being usedor that there is an error in the genome sequence. This open readingframe would not have been detected using the standard prior arttechniques which demonstrates the usefulness of the approach of thepresent invention. The second region shown in FIG. 9 identified as a “10kDa culture filtrate antigen cfp 10 (o69739)”. This clearly does includeinitiating methionine which neatly defines the open reading frame forthe protein.

Both these proteins were found in this spot using standard peptide massfingerprinting. These proteins did not stand out as clearly as in theprevious spot, but were still identifiable. This demonstrates theprocess described in the patent application can also work when multipleproteins are located in the one spot and when the proteins beingsearched for are relatively small.

An incorrect hit is shown in FIG. 8 for comparison. Factors which pointto it being an incorrect hit are that there is no obvious initiationMethionine present, and there are frequent stop codons present in thereading frame.

EXAMPLE B

The method can be applied to higher order genomes including the humangenome. To demonstrate this the genome sequence of chromosome 22 of Homosapiens was prepared and searched using the method described above. Atheoretical peak list was generated using the sequence of Q9BWW9(Apolipoprotein L5) known to be located on chromosome 22. This peak listwas searched against the genome using the method described in the patentapplication using an error tolerance of 0.1 Da and a minimum to match of10. FIG. 14 shows the result of this search. There were 12 hits withbetween 10 and 23 matches. Examining the details of each of these inturn shows all except two of these hits involve matches to repeatregions in the genome i.e., the same peptide occurs multiple timesrepeatedly resulting in an artificially high number of matches. This isshown in FIG. 15. The remaining two hits are on overlapping peptides.One of these is shown in FIG. 15. Comparing the sequence of this segmentto all the proteins in SWISS-PROT using BLAST identifies the protein asQ9BWW9.

COMPARATIVE EXAMPLES

A series of computational simulations were run in order to demonstratethe method and determine the optimum parameters for the method. Thesimplest simulation involved taking the set of known proteins forPseudomonas aeruginosa. The set of 773 known proteins was taken fromSWISS-PROT. Each protein was theoretically digested according to thecleavage rules of trypsin. Tryptic peptides whose mass was less than 400Da were discarded, as these masses are not usually seen on a typicalMALDI mass spectrum. The remaining tryptic peptides of each protein inturn were searched against the raw genome using the method described inthe patent application. The region of the genome coding for the proteinwas determined by finding the segment with the highest number ofmatching peptides. The nearest incorrect hit was determined by findingthe segment with the next highest number of peptides, excluding thosesegments connected to the segment with the highest number of peptidesthrough a chain of overlapping segments. This is illustrated in FIG. 13.This allows for the fact that the protein may be longer than one or moresegments and thus may have a significant number of hits on adjacentsegments.

In order to summarise this information, the proteins were binnedaccording to the number of tryptic peptides with mass greater than 400Da generated from them in a theoretical digestion. The first bincontained all protein with 1 to 10 peptides, the second all proteinswith 11 to 20 peptides, etc. The number of matching peptides in the besthit for each of the proteins in the bin was averaged, as was the numberof matching peptides in the nearest incorrect hit. These two numberswere plotted as in FIG. 11 to show the difference between the correcthit and the best of the incorrect hits.

The results showed a distinct difference between the best hit and thebest of the incorrect hits. The average second best hit has about fourto five matching peptides for small query proteins, increasing to aroundnine to ten matching peptides for larger proteins. For a set of peptidesto clearly be identified with a particular region of the genome, theymust match more than this number of peptides. This is shown in thefigure where the average number of matching peptides in the best hit issignificantly higher than the second best hit. For large proteins, theaverage number of peptide matches approaches 25. This number is limitedby the size of the segment as only a certain number of peptides can beexpected to fit in the 1050 base pair segment. For smaller proteins, thedifference between the first and second hits decreases as there are lesspeptides in the query sequence, but it can be seen that for all but thesmallest proteins, a difference between the two hits is maintained withthe average number of matches in the best hit around six to seven.

Several variations on the simulation were done to estimate the effect ofdifferent parameters involved in using the method.

1) Increasing the minimum to match, increased the difference between thetwo curves.

In an application of the method described, the minimum to match shouldtake a value between four and nine, as this is the range for backgroundhits determined in the experiment outlined above. Generally, a highvalue would be used first to screen out as much background noise aspossible. This value would be gradually lowered, if necessary, until aregion with a significant matching number of peptides is found.

2) Increasing the size of the segments increases the difference betweenthe two curves. The number of random matches in the second best hitincreases slightly, but the number of matches on the best hit increasessignificantly. A very long segment length is not used because once allquery proteins are smaller than the size of the segment no improvementin the obtained and the bigger the segment is the harder it is to locatesmaller proteins. In an application of the method described we use 1050base segments, because this represents a good balance between the two.

3) Changing the composition of the query peak list by adding randompeptides has almost no effect on the curves.

In an application of the method described, the peak list is determinedby the data extracted from the mass spectrometer. The amount of realpeaks and noise peaks is not known in advance.

4) Decreasing the error tolerance for the match between the query massesand the genome masses, increases the difference between the two curves.This is because the query masses are less likely to match another massin the genome through random chance as the difference in mass toleratedwhen accepting a match is much smaller.

In an application of the method described, the error tolerance isusually taken in the range of 0.01 to 0.2 Da for experimental massesderived from MALDI mass spectrometry. The value is usually chosen toreflect the accuracy of the technique used to acquire the experimentalmasses. A typical value is 0.1 Da.

In an application of the method, the peak list used, as input, is themasses of the proteolytic peptides determined by mass spectrometry. Theraw spectrum acquired from the mass spectrometer contains many “noise”peaks. Most of these are removed by using a peak-picking algorithm suchas the one outlined in Breen et al. (2000, in press) [Breen, E. J.,Hopwood, F. G., Williams, K. L., Wilkins, M. R. (2000) Automatic Poissonpeak harvesting for high throughput protein identification,Electrophoresis, 21, 2243-2251; Breen E. J., Holstein, W. L., Hopwood,F. G. Smith, P. E., Thomas, M. L., Wilkins, M. R. (2003) Automated peakharvesting of MALDI-MS spectra for high throughput proteomics.Spectroscopy. In press.]

In the simulated testing described above, the peaks used were the massescalculated from the sequence of theoretically cleaved peptides. Massesunder 400 Da were excluded because a MALDI mass spectrometer cannotgenerally measure peptide masses in this range.

The implementation of the methods described in the above examples,assumes the enzyme used to digest the gel spots is trypsin. This is themost common enzyme used experimentally. Thus the theoretical digestionof the segments is also done using the cleavage rules of trypsin.

The method can use any appropriate enzyme to digest the experimentalproteins. In this case the theoretical digestion of the genome segmentsneeds to use the cleavage rules for the enzyme to be used in theexperimental analysis.

If the experimental analysis is done with multiple enzymes it ispossible to use the findings from multiple searches with each of theenzymes to confirm the identification of the region of the genome. Ifboth analyses identify a certain region of the genome as a possibleprotein-coding region, then the region is more likely to be correctlyidentified as such It is possible that each analysis may not have enoughhits to be clearly distinguished from the background but becausemultiple analyses indicate the same region, it can still be identifiedas the protein-coding region.

In a particular application, a combined search could be implementedwhere a search is trypsin and the hits are tallied to each segment thena search is carried out with other enzymes and hits are tallied to eachsegment. Finally, the hits to each segment from the two searches aresummed to give a composite score per segment. Only hits that are in thesame frame are summed. This combined approach would dramaticallyincrease the sensitivity of identification.

It is also possible to take missed cleaved peptides and modifiedpeptides into account. When the cleavage rules are used to determine thetheoretical peptides, the sequence of peptides resulting from a missedcleavage can also be calculated. This allows the mass of these peptidesto also be determined. During the application of the method of thepresent invention these masses can also be compared to experimentalmasses. Similarly, one can calculate the mass of a modified form of eachof the peptides and check these masses also when comparing against theexperimental masses.

The method can be automated by writing an application or Script to takea series of peak lists and submit each in turn to a search against thegenome. The results of this search can be databased and reviewed at alater time to determine the correct hit.

The present invention works particularly well with small genomes such asbacterial and yeast genomes or other eukayote genomes that have fewintrons and small amounts of non-coding DNA.

The method can also be used for the detection of pseudo genes which areversions of genes which have become defunct and identifying “proteinfamilies” of similar proteins. When a protein from a family of proteinsis detected, a number of regions having a large number of matches may beidentified. This indicates that the proteins may be members of the sameprotein family which may be for example be expressed in differenttissues.

It will be appreciated by persons skilled in the art that numerousvariations and/or modifications may be made to the invention as shown inthe specific embodiments without departing from the spirit or scope ofthe invention as broadly described. The present embodiments are,therefore, to be considered in all respects as illustrative and notrestrictive.

1. A method of identifying one or more proteins in an unannotated DNAsequence, the method comprising: (a) dividing the DNA sequence into aplurality of sequence fragments each fragment being of substantially thesame length and from about 300 to 5000 base pairs long; (b) performing asix frame translation of each of the DNA sequence fragments to obtainsix translated amino acid sequence fragments for each DNA sequencefragment; (c) subjecting each of the translated sequence fragments totheoretical digestion to obtain a plurality of cleaved peptidesequences; and (d) comparing experimental empirical data for peptidefragments from a protein digested in the same manner as the theoreticaldigestion at step (c) with the theoretical data generated in step (c)for each of the translated sequence fragments to identify one or moretranslated sequence fragments which include a significant number ofpeptides present in the digested protein.
 2. The method of claim 1wherein the step (a) of dividing the DNA sequence into a plurality ofsequence fragments is performed before the step (b) of performing thesix frame translation.
 3. The method of claim 1 wherein the step (a) ofdividing the DNA sequence into a plurality of sequence fragments isperformed after the step (b) of performing the six frame translation. 4.The method of claim 1 wherein theoretically generated peptide masses arecompared to the masses of the peptides experimentally generated by thedigested protein and the sequence fragment which has the greatest numberof theoretical peptide masses correlating to the empirical data isidentified as indicating the likely location of the protein of interestin the DNA sequence.
 5. The method of claim 1 wherein the masses of thepeptides experimentally generated from the digested protein aredetermined by mass spectrometry.
 6. The method of claim 1 wherein theDNA sequence is duplicated into a duplicate and an original and theoriginal and duplicate are split in such a manner that the sequencefragments from the duplicate overlap divisions in the original genomesequence.
 7. The method of claim 1 wherein the sequence fragments arefrom 800 to 1200 base pairs long.
 8. The method of claim 7 wherein thesequence fragments are around 1000 to 1050 bases long.
 9. The method ofclaim 1 wherein steps (c) and (a) are performed twice using differentenzymes and data from the two digests is combined and analysed toidentify the protein coding region of interest.
 10. The method of claim1 wherein the in theoretical digest of step (c) all theoretical peptideswhich contain a stop codon are discarded.
 11. The method of claim 1wherein the fragments are numbered so that an overlapping fragment isnumbered n where the fragments it overlaps are numbered n−1 and n+1,where n is an integer.
 12. A method of identifying one or more proteinsin unannotated DNA sequence, the method comprising: (a) performing a sixframe translation of a DNA sequence to provide six translated amino acidsequences; (b) dividing the six translated amino acid sequences into aplurality of fragments, each fragment comprising 100-1666 amino acids;(c) subjecting each of the fragments to theoretical digestion to obtaina plurality of cleaved peptide sequences; and (d) comparing experimentalempirical data for peptide fragment for peptide fragments from a proteindigested in the same manner as the theoretical digestion at step (c)with theoretical data generated in step (c) for each of the fragments toidentify one or more fragments which include a significant number ofpeptides present in the empirically digested protein.
 13. The method ofclaim 12 wherein each six translated amino acid sequences is duplicatedinto an original and a duplicate copy and the original and duplicate ofeach are split in such a manner that the sequence fragments from theoriginal overlap divisions in the original sequence.
 14. The method ofclaim 12 wherein theoretically generated peptide masses are compared tothe masses of the peptides experimentally generated by the digestedprotein and the sequence fragment which has the greatest number oftheoretical peptide masses correlating to the empirical data isidentified as indicating the likely location of the protein of interestin the DNA sequence.
 15. The method of claim 12 wherein step (c) isperformed twice using different enzymes and data from the two digests iscombined and analysed to identify a protein coding region of interest.